A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

Authors

  • Farhad Rezvani Department of Computer Engineering, Urmia Branch, Islamic Azad University, Urmia, Iran
Abstract:

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analyzing the obtained results, it is observed that the accuracy score of the classifier on WebKB, Reuters-R8, and Reuters-R52 datasets significantly improved from 91% up to 96% compared to the best result achieved by other feature selection methods like IG and Chi-2. Whereas, the accuracy score of the classifier on 20NewsGroups dataset didn't see any noticeable improvement and remained close to the most compared methods. Evaluating the performance of the proposed approach shows the superiority of it in obtaining higher accuracy scores when compared with the feature sets selected by other methods.

Upgrade to premium to download articles

Sign up to access the full text

Already have an account?login

similar resources

Feature Selection for Web Page Classification

Web page classification is significantly different from traditional text classification because of the presence of some additional information, provided by the HTML structure and by the presence of hyperlinks. In this paper we analyze these peculiarities and try to exploit them for representing web pages in order to improve categorization accuracy. We conduct various experiments on a corpus of ...

full text

A Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization

Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...

full text

A Novel Approach for Web Page Classification using Optimum features

The boom in the use of Web and its exponential growth are now well known. The amount of textual data available on the Web is estimated to be in the order of one terra byte, in addition to images, audio and video. This has imposed additional challenges to the Web directories which help the user to search the Web by classifying selected Web documents into subject. Manual classification of web pag...

full text

Web page feature selection and classification using neural networks

Automatic categorization is the only viable method to deal with the scaling problem of the World Wide Web (WWW). In this paper, we propose a news web page classification method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features. Each news web page is represented by the term-weighting scheme. As the number of unique words...

full text

Joint Web-Feature (JFEAT): A Novel Web Page Classification Framework

With the increasing amount of web pages over the internet, it has been a major concern to obtain information on the internet accurately at a reasonable cost with decent performance. A potential solution is through the classification of web pages into meaningful categories. An effective classification of web pages is of benefit to various applications such as web mining and search engines. Unlik...

full text

Feature selection using genetic algorithm for classification of schizophrenia using fMRI data

In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...

full text

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}


Journal title

volume 10  issue 4

pages  1- 10

publication date 2019-11-01

By following a journal you will be notified via email when a new issue of this journal is published.

Hosted on Doprax cloud platform doprax.com

copyright © 2015-2023